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FOREWORD 


ABSTRACT 


This  report  ~xplains  some  of  the  msthemstical  techniques  currently  being  used  and 
some  which  are  being  considered  for  solving  a  problem  of  information  storage  and  retrieval. 
Basically  two  problem  characterizations  are  discussed.  The  first  is  a  statistical  descrip* 
tionand  die  other  is  a  vector  space  characterization  Specifically,  we  have  neglected 
the  interesting  area  of  linguistic  analysis  which  is  sometimes  used  as  the  basis  for  in¬ 
formation  retrieval.  Several  examples,  comments  and  suggestions  are  made  regarding  the 
use  of  the  various  techniques. 
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SECTION  I 
INTRODUCTION 


There  exist  several  hundred  different  methods  for  relating  search  requests  to  docu¬ 
ments  contained  in  a  library.  It  would  indeed  be  impossible  to  discuss  all  of  these  (and 
probably  not  desirable);  therefore,  this  report  shall  be  aimed  at  uncovering  the  basic 
mathematics  which  provide  the  foundation  for  most  of  the  retrieval  techniques.  Specifi¬ 
cally,  this  report  will  emphasize  the  mathematics  of: 

1.  Boolean  Algebraic  Retrieval 

2.  Linear  Statistical  Retrieval 

3.  Statistical  Association  Techniques  for  expanding  a  query  and/or  for  expanding 
the  set  of  retrieved  documents. 

4.  Vector  Space  representation  of  the  retrieval  process 

3.  Discriminant  Analysis  Techniques 

It  is  not  intended  that  this  list  of  subjects  exhaust  the  topic  of  mathematics  for  in¬ 
formation  retrieval.  Specifically,  we  have  neglected  the  very  interesting  area  of  linguistic 
analysis  which  is  sometimes  used  as  the  basis  for  information  retrieval.  However,  it  is 
felt  that  the  operational  systems  of  today  and  those  systems  which  will  be  operational  in 
the  near  future  can  be  adequately  described  in  terms  of  the  mathematics  presented  here. 

The  background  material  for  this  report  was  obtained  for  the  most  part  from  the  sources 
listed  in  the  bibliography.  The  descriptions,  examples,  comments  and  suggestions  are 
those  of  the  author. 


SECTION  II 

GENERAL  MATHEMATICAL  MODEL 


It  is  assumed  throughout  that  each  document  and  each  query  is  characterized  by  a  set 
of  identifiers  which  include  keywords,  index  terms,  descriptors,  phases,  concepts,  etc.. 
Furthermore,  it  is  assumed  that  the  necessary  dictionaries,  thesauri  and  algorithms  exist 
for  uniquely  representing  a  document  (or  query)  by  an  appropriate  subset  of  identifiers.* 

Let: 

I  Di  I  ■  set  of  documents  composing  the  library 

■  query. 

■  document  vector  defined  on  the  set  of  identifiers 
m  query  vector  defined  on  the  set  of  identifiers. 

-  No.  of  documents  t  »  No.  of  identifiers. 

■  retrieval  vector 

■  transformation 


Q 

d 

JL 

d 
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•  Salton  suggests  the  following  techniques  for  generating  identifiers  from  a  document 
may  not  only  be  necessary  but  may  also  be  more  productive  than  the  computation  of  higher- 
otder  statistical  associations. 

The  following  principal  procedures  are  available  for  vocabulary  control  and  normali¬ 
zation: 

1.  A  stem-suffix  cutoff  procedure  to  reduce  each  text  word,  or  index  term,  to  word 
stem  and  word  suffix,  thus  producing  a  common  form  for  the  many  different  words  which 
exhibit  the  same  stem  (e.g.,  analyzer,  analysis,  analytic,  analyst,  etc.). 

2.  Use  of  a  synonym  dictionary,  or  thesaurus,  to  replace  semantically  equivalent 
words  by  a  common  identifier  (or  concept  number). 

3.  Use  of  a  hierarchical  subject  arrangement,  such  as  a  library  classification  system, 
capable  of  producing  for  a  given  concept  number  various  types  of  related  concepts,  includ¬ 
ing  more  general  ones,  more  specific  ones,  and  a  variety  of  cross  references. 

4.  Use  of  phrase  dictionaries  to  perform  concept  groupings  by  combining  pairs  or 

triples  of  concepts,  previously  included  in  a  dictionary,  into  a  single,  more  representative 
entity  (e.g.,  the  concepts  "programming*  and  "language*  might  be  transformed  into  a  more 
meaningful  unit  such  as  "programming  language  *).  _ 
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Then  the  general  model  is 


subset  of  I  D  | 

i 


Explanation  of  Transformations 

7 ,(D)  -  is  a  transformation  on  the  set  of  all  documents  which  maps  each  document 
into  the  vector  space  spanned  by  the  identifiers. 

-  is  a  transformation  on  the  set  of  all  queries  which  maps  every  query  into  the 
vector  space  spanned  by  the  identifiers. 

r3^/,^-is  a  transformation  on  the  set  of  all  document  vectors  and  a  query  vector  which 
generates  a  retrieval  vector  designated^. 

T^ir)  -  is  a  transformation  on  the  retrieval  vector  which  generates  a  subset  of  the  set 
of  all  documents 

The  contents  of  storage  is  represented  by  a  C  matrix  of  d  row  vectors,  i.e. 


*—  t  col.  — » 


Since  there  are  d  documents  each  represented  by  t  identifiers,  C  is  a  d  xt  matrix. 

'  It  is  assumed  at  the  outset  that  the  mechanism  for  generating  the  C  matrix  can  be  de¬ 
fined  (that  is,  the  identifiers  have  been  selected  and  thus  7  (D)  can  be  found).  It  will 
turn  out  that  the  C  matrix  provides  the  fundamental  starting  point  for  all  the  analysis  which 
follows. 


SECTION  III 


BOOLEAN  ALGEBRAIC  RETRIEVAL 

Perhaps  the  simplest  and  most  widely  used  retrieval  scheme  is  the  Boolean  Algebraic 
technique  (sometimes  called  the  Inverted  Indices  method). 

Here  the  C  matrix  is  a  binary  matrix 


where 


1  If  identifier  /  is  present  in  document  i 
0  otherwise 


1  if  identifier  i  is  present  in  query  Q 

q  *  x  if  identifier  i  is  not  present  in  Q 

0  if  the  negation  of  identifier  i  is  present  in  Q 

From  the  ^vector  the  retrieval  is  obtained  by  intersection  of  all  sets  S(.  corresponding  to 
1  *•  -  1 —  *  • 

'Negation  must  be  handled  differently  since  the  sets  S(  contain  only  those  documents  which 
contain  identifier  t.  Therefore,  if  identifier  i  is  negated  in  the  query,  we  could  generate  a 
new  set  which  contains  only  those  documents  not  contained  in  S..  Although  this  is  simple 
in  theory  the  operation  is  time-consuming  in  practice. 
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Example  1 


let  t  —  3  and  d  =  6 


1  1  1 


)d  .  d  ,  d  ,  d  l 

1 1  3  *  y 


==>  S  =  id  .  d  .  d  ,i 

2  112  3  5 


id  .  d  .  d  [ 

|  3  5  6| 


Suppose  the  retrieval  request  is: 

Retrieve  all  documents  which  contain  identifier  1  and  identifier  3 
Based  upon  this  request  q  ■=  M 


q  =o  =  l 
’l  *  3 


Since  q  ^  and  q  ^  are  1  we  take  the  logical  and  of  sets  Sj  ano 


Subset  of 


S  0  S 

1  3 


I"  1 


documents  3  and  6  are  retrieved. 
In  this  case  the  retrieval  vector  r  is 


5 


More  complicated  Boolean  expressions  can  be  obtained  using  union  and  negation 


Example  2: 

Suppose  the  query  request  in  Example  1  were  changed  to  read: 

Retrieve  all  documents  which  contain  identifiers  1  and  3  or  contain  identifiers 
1  and  2- 


In  this  case  the  retrieval  is  accomplished  in  three  steps: 

Step  1:  Retrieve  documents  which  contain  identifiers  1  and  3- 
Step  2:  Retrieve  documents  which  contain  identifiers  1  and  2. 

Step  3:  Obtain  the  logical  or  of  the  results  of  step  1  with  those  of  step  2. 


Step  1:  From  Example 


Step  2:  From  query 


'*• 1  f,-  d 

-■[O’ 


.  .  Subset  of  j>  «  S[  f) 

-  (v -v  v 


d .  d  ,  d  V 

2  3  if 


Step  3: 


Subset 


-M 

{ DA  - 

’  {'r  ■'•I 


Thus,  documents  1,  3  and  6  are  retrieved  and 


The  main  drawback  of  Boolean  Retrieval  is  that  it  is  a  "yes”  or  "no”  technique  - 
that  is,  either  the  query  exactly  matches  a  document  or  else  the  document  is  not  retrieved. 
This  is  a  serious  deficiency  since  it  is  unlikely  that  the  user  of  the  system  would  have 
the  foresight  to  specify  precisely  the  query  corresponding  to  the  documents  which  are 
relevant  to  him.  What  is  needed  is  a  retrieval  vector^ which  is  not  binary  but  rather  con¬ 
tains  elements  which  indicate  the  relevance  of  each  document  to  the  query.  In  this  way 
the  documents  can  be  rank  ordered  according  to  their  relevance  in  answering  the  query. 
This  property  will  be  provided  by  Linear  Statistical  Retrieval. 


SECTION  IV 

LINEAR  STATISTICAL  RETRIEVAL 


In  order  to  assign  relevance  numbers  to  the  documents  of  the  library,  given  the  query 
vector  jj,  the  linear  statistical  model  is  normally  used.  Here  the  retrieval  vector^  is 
obtained  by  performing  a  linear  transformation  on  the  query  vector^.  The  transformation 
matrix  is  the  identifier  document  matrix  C  (usually  a  modified  C  matrix  as  will  be  seen). 

Binary  C  Matrix 

In  the  simplest  case 


r 

i 


where 


j  1  if  the  identifier  /  is  present  in  document  i 


C  ■  »  0  otherwise 

>/  i 


9, 


jl  if  identifier  t  is  present  in  query  Q 
)0  otherwise 


The  relevance  of  the  query  to  document  j  would  be  indicated  by  the  value  of 


,  1  C..  q. 

.  1 1  « 
«-  1 


Note  that  r  is  simply  the  sum  of  the  number  of  identifiers  which  are  present  in  both 
the  document  and  the  query. 

There  exist  at  least  two  serious  drawbacks  to  using  this  simple  linear  model.  The 

first  has  to  do  with  the  fact  that  the  C  matrix  is  binary.  Since  we  are  interested  in 

computing  the  relevance  of  a  document  based  upon  the  query,  it  would  seem  that  the 

elements  of  the  C  matrix  should  reflect  the  relevance  to  a  document  given  an  identifier 

had  occurred  in  the  document.  That  is  to  say,  C ..  should  be  the  relevance  of  identifier  j 

to  document  i  given  that  identifier  /'  occurred  in  document  i.  The  assignment  of  the  C  *s 

»7 

could  be  accomplished  either  manually  or  algorithmically.  If  performed  manually  someone 
would  have  to  estimate  them  at  the  time  the  document  is  stored  in  the  library.  The  assign¬ 
ment  could  be  done  algorithmically  by  setting  C  equal  to  the  relative  frequency  of  the 
occurrence  of  identifier  /  in  document  i. 

The  second  deficiency  of  this  simple  linear  model  has  to  do  with  the  fact  that  the 
binary  relevance  coefficient  reflects  only  the  number  of  identifiers  which  match  in  the 
document  and  the  query,  and  does  not  take  into  account  the  number  of  mismatches. 
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For  example  let 

q  -  100111000 
£  -  000111000 

-  111101111 

n~2 


where  d .  is  the  ith  row  vector  of  C  now 


r  -  d\q  m  $  and 

i  **-  l 


r  «  d  q 
2  ~  2  'i 


the  relevance  of  document  1  equals  the  relevance  of  document  2 

This  is  certainly  counterintuitive  since^d‘is  much  closer  to  the  query  q  than  is^' , 

Weighted  C  Matrix 

Let  C  be  a  weighted  matrix 


where 


C. 

»/ 


relevance  to  document  i  given  identifier  j 


Note  that  there  is  no  reason  why  C ..  can’t  be  negative.  In  fact  if  identifier  /  never 
appears  in  document  i  it  would  seem  reasonable  that  C  .  <  0 


»/ 


Examining  the  linear  model 


^  -  c  q 

^  r— 


where  q  is  binary 


r 

i 


i 


i 


C.  q 

‘ /  / 


the  relevance  of  document  i  to  the  query ^  is  simply  the  algebraic  sum  of  the  individual 
relevance  coefficients  (  i.e.,  C.’s  )  which  correspond  to  the  identifiers  in  the  query  q. 


There  are  yet  a  few  deficiencies  present  in  our  linear  model.  One  of  these  involves 
the  use  of  a  binary  query  vector.  The  user  may  not  consider  each  identifier  in  his  query 
vector  equally  important  in  which  case  he  may  wish  to  weight  the  elements  of^. 
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Weighted  C  Matrix  and  Weighted  q  Vector 
The  linear  model  is 

jr^  =*  C  £  where  C  is  weighted  as  before 


q  =  the  weight  assigned  to  identifier  i. 

Our  linear  model  has  generalized  a  good  deal;  however,  there  still  exists  a  very 
important  and  fundamental  question  which  has  not  been  answered.  This  question  involves 
the  constraints  upon  the  weights  used  in  the  C  matrix  and  in  the  q  vector.  This  question 
is  treated  in  a  vague  way  in  the  literature;  however,  it  is  of  fundamental  importance 
since  the  system  retrieval  will  vary  widely  depending  upon  it’s  answer. 

In  order  to  clarify  (and  answer)  the  problem  of  constraints,  the  linear  retrieval  method 
will  now  be  interpreted  as  an  operation  in  a  linear  vector  space. 

Linear  Vector  Space  Interpretation 
Given  the  weighted  C  matrix, 


represents  the  t  dimensional  row  vectors  as 


dc. 


Here  the  documents  are  represented  as  t  dimensional  vectors  in  the  vector  space 
spanned  by  the  t  identifiers. 


The  query  vector  can  be  represented  in  the  same  space  as  a  t  dimensional  vector. 

The  Linear  Retrieval  Model  can  now  be  interpreted  as  a  set  of  vector  operations  in 
the  linear  vector  space 


r  »  Co 
<•«—  * 


thus 

t 

ri~  * 
/'- 1 


or  equivalently 
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r  -  <  d  q  >  =  d‘  q 

•  ‘'-I  <*•« 

Thus,  the  relevance  of  document  »  to  the  query  ^j_is  simply  the  inner  product  of  the 
ith  document  vector  with  the  query  vector.  Here  it  is  obvious  that  the  measure  of  rele¬ 
vance  is  directly  related  to  the  measure  of  "closeness”  of  the  document  vector  to  the 
query  vector  in  the  vector  space.  At  this  point  one  might  be  tempted  to  drop  the  Linear 
Models  =  C^£  in  favor  of  using  other  metrics  which  measure  the  "closeness*  between 
two  vectors  (such  as  Euclidean  Distance,  Box  Car  Norm,  etc.).  However,  since  we  are 
concerned  with  analyzing  the  Linear  Model  we  shall  focus  our  attention  on  the  inner 
product  as  our  measure  of  "closeness*  (or  relevance). 

Example  3: 

Let  d  =  4  and  t  =  2.  That  is,  the  library  contains  4  documents  each  represented  by 
2  identifiers.  Here  the  linear  vector  space  is  spanned  by  2  identifiers  and  so  the  space 
has  dimension  2. 


C 


t 

l 


t 

2 

c 

4 

-  [  Cn 

c  ] 
12 

11 

12 

di 

=  [  c 

c 

l 

21 

c 

22 

-2 

21 

22 

c 

d‘ 

=  [  c 

c 

1 

31 

32  , 

~3 

31 

32 

41 

c 

42  1 

d‘ 

'•'*4 

*  t  c 

41 

C  o  I 

42 

The  query  vector  is  similarly  2  -dimensional 


We  can  now  return  to  the  important  question  of  normalization.  It  has  been  shown 
that  the  Linear  Statistical  Model  requires  that  the  inner  product  between  a  document 
vector  and  the  query  vector  be  the  relevance  measure  for  that  document.  The  inner 
product  is  given  by: 

r .  =  i‘  q  a  1  d.  |  |  o  I  COS  6 

where 

|  V  j  =«  magnitude  of  the  vector 

and 

6  is  the  angle  between  the  vectors. 

Notice  that  the  inner  product  is  directly  proportional  to  the  product  of  the  magnitudes 
of  the  vectors.  Herein  lies  the  normalization  problem.  A  user  who  assigns  large  weights 
to  his  query  vector  may  get  a  completely  different  response  than  another  user  who  uses 
the  same  relative  weights  but  uses  weights  of  a  much  lower  magnitude.  The  documents 
will  have  the  same  rank  ordering;  however,  the  documents  retrieved  will  depend  upon 
*[,£,]•  1°  view  of  this  problem  the  most  reasonable  process  is  to  require  that  the  docu¬ 
ment  vectors  and  the  query  vectors  be  normalized  to  unit  vectors.  In  this  case  the 
measure  of  "closeness*  (or  relevance)  is  determined  only  by  the  cosine  of  the  angle 
between  the  two  vectors,  i.e., 

r.  «  d.  a  =.  cos  $ 

l  l  *<. 

Now  the  constraints  on  the  weights  of  the  C  matrix  and  the  q  vector  are  specified. 

r>— 

That  is 

\J.\  =  1>  2  C  2  and 
‘  /  =  i 

I  XI  -  1=  *  9  .2 

/-  i 

At  this  point  we  have  generalized  the  Linear  Statistical  Model  to  the  point  where 
the  C  matrix  and  the^  vector  are  weighted  and  properly  normalized.  However,  there 
still  exists  many  deficiencies  in  the  model.  In  particular  consider  the  problem  of 
formulating  the  query  vector.  Ideally  the  user  should  construct  that  query  which  best 
matches  all  the  document  vectors  which  are  of  interest  to  him.  However,  he  cannot  be 
expected  to  know  the  relevance  of  each  identifier  to  every  document  and  further,  he  will 
not  be  expected  to  assign  a  weight  to  every  possible  query  identifier.  To  meet  this 
need,  some  automatic  Statistical  Association  Techniques  can  be  employed  to  modify  a 
users  query  so  as  to  generate  a  larger,  more  comprehensive,  query.  It  will  be  shown  that 
the  same  techniques  used  to  broaden  a  query  can  be  used  to  broaden  the  system  response. 


SECTION  V 

STATISTICAL  ASSOCIATION  TECHNIQUES 


i 

1 


t 


The  central  idea  behind  using  Association  Techniques  (these  techniques  are  sometimes 
called  “clustering"  or  "clumping")  is  to  add  identifiers  to  a  query  by  using  the  pair-wise 
statistical  relations  which  exist  between  identifiers. 

Therefore  we  wish  to  obtain  aix  I  matrix  which  reflects  the  similarity  between  the 
identifiers.  Let  S  be  such  a  similarity  matrix 


S  = 


t 

t 

i 


s. 

</' 


Here  the  i/th  element  indicates  the  degree  of  similarity  between  the  ith  identifier 
and  the  /th  identifier.  There  exist  many  ways  of  generating  similarity  matrices  but  each 
method  must  use  the  association  information  inherently  contained  within  the  documents 
of  the  library.  All  this  information  is  contained  in  the  C  matrix  and  so  the  C  matrix  is 
always  used  as  the  starting  point.  A  useful  Similarity  Matrix  is  the  Covariance  Matrix* 
defined  as 


S 

t 


t 

II 


s 

«/ 


where 


S. 

*/ 


1 

d 


i 


(CL 


c.)  <C..  -  C.) 
«  */  / 


where  _ 

C 

I 


1  i 

*  k- i 


k  i 


This  is  the  average  of  the 
ith  column  vector  of  the 
C  matrix 


S..  is  simply  the  covariance  between  the  ith  identifier  and  the  /th  identifier. 


*  Again  the  literature  is  particularly  vague  on  the  subject  of  similarity  measures.  I 
suggest  two  other  possible  measures  as  follows: 

1.  S  -  CTC 

t 

Here  S..  -  inner  product  of  the  ith  column  vector  of  C  with  the  /th  column  vector 
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t 

I 

J 


Once  the  similarity  matrix  has  been  generated  we  can  interpret  the  i/th  element  as  the 
strength  of  the  association  between  identifier  i  and  identifier  j.  It  is  convenient  to  represent 
this  information  as  an  undirected**  graph  where  the  nodes  represent  the  identifiers  and  the 
weight  of  the  links  represent  the  association  coefficients  br  veen  the  nodes  (i.e.,  identi¬ 
fiers)  which  they  connect.  That  is  the  links  on  this  graph  indicate  the  "strength*  of  the 
1st  order  associations  between  nodes.  By  taking  products  of  these  "strengths"  along 
paths  of  length  two  second  order  associations  can  be  obtained.  For  example  in  Figure  1 
a  second  order  association  between  node  i  and  node  k  is  given  by  the  product 


S..S.L 

i)  ]k 


The  sum  of  all  second  order  associations  between  node  i  and  node  j  is  obtained  by 
examining  the  i/th  element  of  the  matrix  obtained  by  squaring  the  Similarity  Matrix  =  S2. 
Therefore,  second  order  associations  are  obtained  from 


T 

S2  -  SS  *  t 


5<2> 

*7 


1 


*  Continued 

of  C.  The  previous  discussion  on  normalization  is  pertinent  here. 

2.  Another  measure  of  similarity  could  be  obtained  by  considering  the  Euclidean  distance 
between  the  identifier  vectors  (i.e.,  the  column  vectors  of  C)  in  the  space  spanned  by  the  docu¬ 
ments.  Here  we  would  have  t  identifier  vectors  represented  in  a  d-dimensional  vector  space 
spanned  by  the  d  documents 

S..  -  |  It.  f 1 

where^  is  the  ith  colum  vector  of  C 

•*  Note  that  the  graph  is  undirected  since  S -  S-t  ,  that  is  the  association  from  i  to  j. 
If  the  similarity  matrix  were  not  symmetric  then  the  graph  would  be  directed. 


i 
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where 


sum  of  all  second  order 
associations  between 
node  t  and  node  / 


s<2> 

*/ 


Any  order  association  between  identifiers  can  be  obtained  by  raising  the  S  matrix 
to  the  desired  power.  That  is,  the  nth  order  associations  are  obtained  from 
*-n  times-* 

S"  =  S  •  S  •  •  •  S 

Returning  to  the  question  of  expanding  our  query  vector  a  by  using  higher  order 
associations  between  identifiers,  it  is  clear  that  we  must  have  some  means  of  taking  into 
account  the  relative  importance  of  the  different  order  associations.  It  would  seem  reason¬ 
able  that  1st  order  associations  are  more  important  than  second  which  in  turn  are  more 
important  than  third  and  so  on.  To  accomplish  this,  the  following  method  is  suggested 
by  Salton. 

Let  q *  =  expanded  query  vector 

^  «  original  query  vector 

S(  =  similarity  matrix 

a  =  positive  constant  less  than  one  0  <a  <  1 
then  define 

£  •  ♦<«*/  ‘•■•U 

Here  we  have  weighted  the  higher  order  associations  by  the  appropriate  power  of  a 
and  since  0  <  a  <  1,  a"  is  monotonically  decreasing  as  n  increases. 

Example  4:  Second  Order  Association. 

For  the  purpose  of  simplicity,  suppose  that  the  similarity  matrix  has  been  threshold 
at  the  level  0-. 

That  is 

1  if  S.  >  0 

*5  9  V  IJ 

*/ 

o  if  s  <e 

</ 

Then  the  similarity  matrix  is  a  binary  matrix 
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Let 


S 

threshold 


1110  0 
110  0  0 
10  111 
0  0  111 
0  0  111 


the  resulting  graph  is 


The  presence  of  a  link  indicates  that  the  connected  nodes  are  associated  in  S 

threshold 


Now 


s2  =  ss 


3 

2 

2 

1 

1 


2 

2 

1 

0 

0 


2 

1 

4 

3 

3 


1 

0 

3 

3 

3 


1 

0 

3 

3 

3 


S2  yields  the  second  order  associations  between  pairs  of  nodes.  Since  S  ,  ,  .  , 

r  threshold 

is  binary  the  ij  element  of  the  S2  matrix  is  simply  the  number  of  paths  of  length  two 
between  node  i  and  node  j.  This  may  be  verified  by  examination  of  the  graph. 


Once  the  expanded  query  vector  is  obtained  (i.e. ,  q*  above)  it  must  be  normalized 
such  that  |  q*  \  =■  1. 


The  exact  same  techniques  used  for  expanding  the  query  vector  can  be  used  to  expand 
the  set  of  retrieved  documents.  Suppose  that  Linear  Statistical  Retrieval  is  used  so  that 

r  »  C  q 

Remember  that  the  elements  of  the^vector  are  the  relevance  indicators  for  the  docu¬ 
ments,  that  is 


r.  m  relevance  of  document  i 

i 
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Now  a  new  relevance  vector  can  be  obtained  by 
)  +  (aSd)2  +  (a  Srf)3  +  .  •  •  U 

where  is  a  d  x  d  similarity  matrix  defined  on  the  documents.  The  i/th  element  of 
Sd  is  the  association  of  the  ith  document  to  the  /th  document. 

Another  very  interesting  way  of  expanding  a  retrieval  set  is  obtained  by  using 
bibliographic  citations.1  The  mechanism  for  doi:  this  is  quite  similar  to  the  methods 
used  for  association. 

We  begin  by  obtaining  a  d  x  d  binary  matrix  where  the  documents  are  chronologically 
ordered  along  the  rows  and  columns.  That  is 


where 

df  is  rlu  oldest  document 
^2  the  next  oldest  and  so  on. 

The  rows  represent  the  documents  being  cited  and  the  columns  the  source  of  the 
citation.  Therefore 

{1  if  document  j  cites  document  i 
0  otherwise. 

The  elements  on  or  below  the  diagonal  are  zero  since  a  document  can  only  cite  a 
previously  published  document  and  further  no  document  cites  itself.  Now  proceeding  as 
before,  higher  order  linkages  can  be  examined  by  taking  higher  order  powers  of  the  M 
matrix. 

For  example  taking  the  nth  power  of  the  M  matrix  and  examining  the  ij  element  of 
M  "  (»  <  /  and  j  >  n  )*  we  can  obtain  the  sum  of  nth  order  linkages  between  document 
i  and  document  j. 

•  It  is  easy  to  show  that  ^  =  0 

i j 

where 

Mn=  [  MfnJ  } 

•i 

For  /  <.  n. 
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Now  by  examining  the  Wn  matrix,  all  documents  which  exhibit  strong  nth  order  links 
are  collected  into  groups.  No*  we  can  expand  the  retrieved  document  set  by  adding  the 
document  groups  which  are  strongly  linked  to  the  orginally  retrieved  documents. 

For  an  example  of  a  proposed  Linear  Statistical  Retrieval  System  see  reference  2. 

A  Statistical  Viewpoint 

Viewing  the  problem  of  information  retrieval  from  a  statistical  point  we  would  like 
to  compute  the  probability  of  a  document  being  relevant  given  the  query  vector  9.  That 
is  to  say  we  should  like  to  compute 

P  (d. /^)  =  Probability  that  document  i  is  relevant  given  query  9 

Following  the  usual  procedure  we  employ  Bayes  Rule  to  get 

P(d./q)  =  Wd 

‘  ~  p(l)  p(2) 

Note  that  in  theory  Pi^/d^  could  be  estimated.  P(g/d;)  is  the  probability  of  query 
^ given  that  document  i  is  relevant.  We  could  accomplish  this  by  observing  the  relative 
frequency  of  the  9  vector  under  the  condition  that  document  i  is  relevant  to  the  user 
generating^.  Of  course  this  procedure  would  have  to  be  done  many  times  for  all  possible 
query  vectors.  Clearly  this  is  impossible  in  any  practical  sense. 

P(d{)  could  be  estimated  by  the  relative  frequency  with  which  document  i  is 
considered  relevant. 

P(jj)  is  a  constant  given  any  query  and  therefore  poses  no  problem  of  estimation. 

In  order  to  simplify. our  problem  let  us  assume  that  the  identifiers  composing  the 
^vector  are  statistically  independent.  In  this  case 

P(g/D)  -  n  P(q  /D) 
k  * 

Now  .t 

n  P(qk /  d.)  P(d.) 

P(dt /a)  =  - - -  -  (const.)  P(d.)  II  P(9  /d.) 

t  *  1  *  * 

n  P(q.)  *“• 

k  * 


Since  the  log  function  is  a  monotonic  function  of  its  argument  we  can  use  log 
P(d,/q)  to  estimate  the  relevance  of  document  t.  Taking  the  log 


log  P(d,  /q)  -  const.  ♦  log  P(d) 

1  l 


t 

+  1 
k  - 1 


log 
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Now  ii  is  assumed  that  we  can  estimate  the  P(q^/d)  in  much  the  same  way  we 
obtained  the  weights  in  the  C  matrix.  Given  a  ^9  vector  we  can  get  P(qjd.  )  and  the 
relevance  factor  can  be  obtained  as  a  linear  function. 

t 

r  -  log  P(d./q)  =  const.  +  log  P(d.)  +  2  logP(9  /d  ) 

|  4  l  lc  =:  1  K  l 

It  should  be  noted  that  this  simple  linear  relation  is  obtained  under  the  assumption 
of  statistical  independence.  For  further  discussion  of  statistical  techniques  see 
reference  3- 

Before  going  on  to  other  topics  it  is  worth  noting  that  the  highest  order  statistics 
considered  thus  far  are  only  second  order.  Even  though  higher  order  associations  were 
employed  they  were  generated  taking  account  only  of  second  order  statistical  relation¬ 
ships. 
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SECTION  VI 

VECTOR  SPACE  REPRESENTATION 


The  vector  space  representation  has  already  been  given  where  the  d  documents  are 
represented  as  t  dimensional  vectors  in  the  space  spanned  by  the  t  identifiers.  Similarly 
the  query  vector  is  represented  as  a  t  dimensional  vector  in  the  same  space.  If  we  were 
to  implement  the  Linear  Statistical  System  previously  described  we  would  have  to  store 
the  d  t— dimensional  document  vectors.  This  would  require  d  x  t  numbers.  Typically 

d  =  500,000  documents 
t  =  1000  —  10,000  identifiers. 

•  ‘ •  d  x  t  »  5  x  108  -  5  x  109  numbers. 

If  each  coordinate  were  represented  by  5  bits,  the  system  would  have  to  store  up  to 
2.5  x  1010  bits,  to  represent  d  documents  in  the  t  dimensional  vector  space. 

Because  of  the  size  of  the  required  storage  we  are  motivated  to  search  for  lower 
dimensional  vector  spaces  in  which  we  can  represent  the  document  vectors  and  still 
perform  meaningful  retrieval. 

One  possibility  for  accomplishing  this  function  is  to  find  a  K  dimensional  subspace 
of  the  t  dimensional  vector  space  such  that  the  K-space  is  "best"  K  dimensional  space 
in  the  least  squares  sense. 

The  solution  to  this  problem  is  well  known  in  the  field  of  linear  algebra.  It  turns  out 
that  the  solution  is  given  by  the  K  eigenvectors  corresponding  to  the  K  largest  eigenvalues 
of  the  covariance  matrix  defined  earlier. 


Let 


where 


;['■] 


Covariance  Matrix 


7-^,  (c*. '  9  <c»,  -  9 


C.  = 


~r 


*  ct. 


ki 
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where 


C  - 


-  document  identifier  matrix 


Then,  solving  the  following  eigenvector  problem  yields  the  appropriate  K  eigenvectors 
Where  e_ is  a  t  dimensional  eigenvector 

Having  solved  this  problem  we  then  define  a  linear  transformation  A  as  follows 


«-  K 


A- 


e  . 

2 


Where  the  column  vectors  are  the  K  eigenvectors  obtained  above. 

Now  the  document  vectors  are  projected  into  the  K  dimensional  subspace  by  the 
linear  transformation  Ai.e. 

c' »  cA 

Since  C  is  d  x  t  and  Ais  t  x  K,  C  '  is  d  x  K.  Therefore,  we  need  to  represent  each 
of  the  d  document  vectors  in  the  K  space  by  only  K  numbers  in  lieu  of  the  t  numbers  we 
orginally  needed.  Since  K  <  t  we  have  saved  storage.  The  typical  savings  might  be  a 
factor  of  1000. 

Now  when  a  query  vector  is  generated  we  map  it  into  the  K  space  by 
o'  -  A  T  9  where  o' is  Kx  1 

Retrieval  is  accomplished  in  the  K-space  just  as  before,  i.e. 
r  -  C '  q' 

^  r- 

Note  that  the  vectors  need  not  be  re-normalized  in  the  K-space  sinceAis  an  orthogonal 
transformation  which  means  that  the  vector  magnitudes  are  invariant  under  this  transformation* 


•To  show  this  let 


zm  Ai 

The  magnitudes  are^  y  and  x 

Z  X.  "  XL  ^  A 

but  since  Ais  orthogonal  A  T  _A-1;  Ar  A 


.  T 

.  r  y 


xT\* 


T 

X  X 


I 
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Another  interesting  method  for  reducing  the  dimensionality  of  the  vector  space  has 
been  proposed  by  Assotio*.  Assorio  begin:  the  problem  by  grouping  the  documents  in 
the  library  into  a  number  of  fields.  He  then  asks  several  experts  working  in  a  particular 
field  to  generate  a  /  dimensional  vector  which  typifies  that  field.  This  is  accomplished 
by  each  expert  going  through  all  t  identifiers  and  ranking  their  importance  (or  relevance) 
to  his  field.  An  average  vector  for  each  field  is  then  obtained  by  averaging  together 
the  vectors  generated  by  each  expert  in  that  field.  If  there  are  /  fields,  this  process  will 
generate  /  t  dimensional  vectors. 

The  underlying  concept  here  is  that  the  experts  within  a  field  will  draw  upon  their 
knowledge  and  experience  to  generate  a  'good*  representative  vector  for  that  field. 

Then  the  averaging  of  the  expert  vectors  within  the  field  together  will  further  smooth 
the  effects  of  each  individual.  The  result  should  then  be  the  "best”  t  dimensional 
vector  for  that  field.  Here  "best"  means  that  the  resultant  field  vector  fits  all  the 
document  vectors  in  the  "best"  way  possible. 

We  can  represent  Assorio’s  information  at  this  point  by  the  F  matrix 

T 

F  -  / 

i 

where  the  column  vectors  are  the  average  field  vectors. 

Now  since  the  dimensionality  of  the  space  cannot  exceed  the  min  I  /,  f  land  typically 
/  <  <  /,  we  can  represent  our  document  vectors  and  query  vectors  in  an  /  dimensional 
space.  It  may  still  be. possible  to  solve  our  problem  in  an  even  smaller  dimensional 
space  by  using  the  least  squares  subspace  fit  as  described  earlier.  Assorio  accomplishes 
a  similar  subspace  fit  using  Factor  Analysis. 

In  any  case,  if  the  f  dimensional  space  defined  by  the  /  field  vectors  is  not  reduced 
further  a  Schmidt  Orthogonalizacion  procedure  should  be  used  in  order  to  define  an 
orthogonal  /-dimensional  subspace  for  representing  the  d  documents. 

Using  this  technique  we  have  reduced  the  required  storage  from  d  x  /  down  to  d  x  / 
which  typically  is  a  factor  of  1000  times! 
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SECTION  VII 

DISCRIMINANT  ANALYSIS 

Up  to  this  point  our  concern  in  dimension  reduction  has  focused  on  fitting  the  docu¬ 
ment  vectors  in  a  subspace  of  the  t  dimensional  space. 

Here  the  emphasis  changes  abruptly.  Our  concern  in  this  section  is  to  find  a  p  -  1 
dimensional  subspace  which  is  optimal  for  discriminating  between  p  groups  (or  classes) 
of  documents.  This  problem  is  discussed  by  Williams5  and  its  solution  is  classically 
obtained  using  Discriminant  Analysis. 

To  begin  a  discussion  on  Discriminant  Analysis  it  is  best  to  select  a  simple  case  so 
that  the  b.  ic  ideas  are  not  clouded  by  the  algebra.  Therefore,  assume  that  p  =  2,  that 
is,  there  arc  two  groups  of  document  vectors  located  in  the  t  dimensional  space. 


t 


Now  we  wish  to  project  these  two  groups  orthogonally  onto  a  line  so  that  the  varia¬ 
tion  between  the  projected  groups  is  as  large  as  possible,  relative  to  the  variation  within 
the  two  projected  groups.  The  problem  is  to  find  the  direction  of  projection  which  will 
accomplish  this.  It  will  turn  out  that  this  is  equivalent  to  finding  that  direction  of  pro¬ 
jection  which  maximizes  the  distance  between  the  projected  means  relative  to  the  sum  of 
the  variabilities  of  the  projected  groups. 

Definitions: 

w  mean  vector  of  group  1  t  x  1 
m  mean  vector  of  group  2  t  x  1 

*  Jtl  ~t2 

a  within  Groups  Scatter  Matrix  t  x  t 


A 

A 

i 
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B  s  between  Groups  Scatter  Matrix  t  x  t 

S  =  pooled  Scatter  Matrix  t  x  t 

=  data  matrix  for  group  i  i  «  1,  2  with  mean^.  subtracted  from  each  column. 


Now 

IT 


vi 

‘i 


where 

1/  (g)m  S®  (  A.  -  X(gJ)  (  A  (tJ  -  A. *'*')«- 1,2 
<7  r  »  l  ir  '  >'  I 

=  A'*'7" 

I  -  A(1)  A(nr  ♦  A(2’  A(2,T 

Notice  that  Xfg)  X(g)  T  differs  from  the  covariance  matrix  only  by  a  l/(Ng-l) 
normalizing  factor. 

The  S  matrix  is  computed  in  a  similar  fashion  by  first  pooling  both  groups  together, 
then  computing  the  covariance  matrix  £. 

That  is  S  =  (N  +  N  -  1)  2 
I  2 

Now 

S  -  B  +  f 

so  that  B  can  be  computed 
B  -  S-  f 

For  the  case  where  there  are  only  two  groups 
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The  formal  statement  of  the  problem  is:  Find  a  direction  A  which  maximizes  the 
projected  between  class  scatter  for  a  fixed  value  of  the  projected  within  class  scatter. 
It  is  easy  to  show  that  the  projected  scatters  are  given  by: 

f 

A.  B  A  «  projected  between  class  scatter 

7*  ( i ) 

A  9  A  »  projected  within  class  scatter 


we  wish  to  maximize  A  BA  under  the  constraint  that  A  ‘ 

|~iu 

This  is  conveniently  handled  using  Lagrange  multipliers. 


W  A 


remain  constant. 


AT  B  A  -  A (  AT  9  A 

r—  ~  >— 

Lagrange  Multiplier 


Const) 


0 


(2) 


This  gives 


fid-Afd-O  or 

(  B  -  A  *  )  A  -  0  (3) 

In  order  for  a  non-trivial  solution  to  exist  (i.e.  other  than  A  »  0)  the  determinant  of 
(  B  -  A  If  )  must  vanish. 


|  fi  -  a  »  |  -  0 

This  problem  is  recognized  as  the  generalized  form  of  an  eigenvector  problem  where 
A  is  an  eigenvalue. 

Now  extending  the  problem  to  the  case  of  P  groups,  the  discriminant  analysis  solution 
will  result  in  the  identical  eigenvector  problem  where  the  ( P  -  1)  eigenvectors  are  the 
desired  optimal  subspace  for  discrimination.  See  Wilks6  (Pg  576)  for  further  discussion. 

For  the  special  case  of  two  groups  the  solution  A  can  be  found  directly  from  equation 
T  .  ^ 

2  by  substituting  B  »  K  A  into  the  expression^  B  A  (K  ■  const.) 
i.e. 

AT  B  A  »  K  AT  A \T A  -  K  (  Ard)2 


K  (  Ar/1)2  -  A(  AT  I  A 

u  L 


Const.) 


0 


(KA'/f)  A  -  A  9  A 


0 


^  •  a  I  1  A 
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where  a  «  const. 


*JkTA 

A 


Therefore,  the  direction  of  A  is  obtained  which  solves  the  discrimination  problem 
between  two  groups. 
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Thl»  report  explains  some  of  tbe  mathematical  techniques  currently  being 
used  and  seme  which  are  being  considered  for  solving  a  problem  of  Information 
storage  and  retrieval.  Basically  two  problem  characterisations  are  discussed. 

The  first  Is  a  statistical  description  and  the  other  Is  a  vector  space  characteri¬ 
sation.  Specifically,  ve  have  neglected  the  interesting  area  of  linguistic  analysis 
which  is  sometimes  used  as  the  basis  for  information  retrieval.  Several  examples, 
comments  and  suggestions  are  made  regarding  the  use  of  the  various  techniques. 
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